Goto

Collaborating Authors

 total training time



degree of the polynomial approximation to sigmoid, but observed that

Neural Information Processing Systems

We thank all reviewers for their helpful comments, and provide our response below (reviewers' comments are italicized). We will also apply the changes suggested by the reviewer to improve the overall presentation. We would like to clarify/correct a few key points in reviewer's comments, with the hope that this will help PPML schemes that we compare with (and build upon). Poisoning attacks is certainly an interesting future direction. Please note that, even in this domain, the problem is already very challenging and an active area of research.


Load-Aware Training Scheduling for Model Circulation-based Decentralized Federated Learning

Kainuma, Haruki, Nishio, Takayuki

arXiv.org Artificial Intelligence

E-mail: nishio@ict.eng.isct.ac.jp Abstract --This paper proposes Load-aware Tram-FL, an extension of Tram-FL that introduces a training scheduling mechanism to minimize total training time in decentralized federated learning by accounting for both computational and communication loads. The scheduling problem is formulated as a global optimization task, which--though intractable in its original form--is made solvable by decomposing it into node-wise subproblems. T o promote balanced data utilization under non-IID distributions, a variance constraint is introduced, while the overall training latency, including both computation and communication costs, is minimized through the objective function. Simulation results on MNIST and CIF AR-10 demonstrate that Load-aware Tram-FL significantly reduces training time and accelerates convergence compared to baseline methods. Federated learning (FL) enables model training without exporting data, making it particularly effective for privacy-sensitive applications.


Benchmarking learned algorithms for computed tomography image reconstruction tasks

Kiss, Maximilian B., Biguri, Ander, Shumaylov, Zakhar, Sherry, Ferdia, Batenburg, K. Joost, Schönlieb, Carola-Bibiane, Lucka, Felix

arXiv.org Artificial Intelligence

Computed tomography (CT) is a widely used non-invasive diagnostic method in various fields, and recent advances in deep learning have led to significant progress in CT image reconstruction. However, the lack of large-scale, open-access datasets has hindered the comparison of different types of learned methods. To address this gap, we use the 2DeteCT dataset, a real-world experimental computed tomography dataset, for benchmarking machine learning based CT image reconstruction algorithms. We categorize these methods into post-processing networks, learned/unrolled iterative methods, learned regularizer methods, and plug-and-play methods, and provide a pipeline for easy implementation and evaluation. Using key performance metrics, including SSIM and PSNR, our benchmarking results showcase the effectiveness of various algorithms on tasks such as full data reconstruction, limited-angle reconstruction, sparse-angle reconstruction, low-dose reconstruction, and beam-hardening corrected reconstruction. With this benchmarking study, we provide an evaluation of a range of algorithms representative for different categories of learned reconstruction methods on a recently published dataset of real-world experimental CT measurements. The reproducible setup of methods and CT image reconstruction tasks in an open-source toolbox enables straightforward addition and comparison of new methods later on. The toolbox also provides the option to load the 2DeteCT dataset differently for extensions to other problems and different CT reconstruction tasks.


YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Li, Yi, Guo, Zhichun, Li, Guanpeng, Li, Bingzhe

arXiv.org Artificial Intelligence

Graph neural networks (GNNs) have become essential tools for analyzing non-Euclidean data across various domains. During training stage, sampling plays an important role in reducing latency by limiting the number of nodes processed, particularly in large-scale applications. However, as the demand for better prediction performance grows, existing sampling algorithms become increasingly complex, leading to significant overhead. To mitigate this, we propose YOSO (You-Only-Sample-Once), an algorithm designed to achieve efficient training while preserving prediction accuracy. YOSO introduces a compressed sensing (CS)-based sampling and reconstruction framework, where nodes are sampled once at input layer, followed by a lossless reconstruction at the output layer per epoch. By integrating the reconstruction process with the loss function of specific learning tasks, YOSO not only avoids costly computations in traditional compressed sensing (CS) methods, such as orthonormal basis calculations, but also ensures high-probability accuracy retention which equivalent to full node participation. Experimental results on node classification and link prediction demonstrate the effectiveness and efficiency of YOSO, reducing GNN training by an average of 75\% compared to state-of-the-art methods, while maintaining accuracy on par with top-performing baselines.


Distributed Deep Reinforcement Learning Based Gradient Quantization for Federated Learning Enabled Vehicle Edge Computing

Zhang, Cui, Zhang, Wenjun, Wu, Qiong, Fan, Pingyi, Fan, Qiang, Wang, Jiangzhou, Letaief, Khaled B.

arXiv.org Artificial Intelligence

Federated Learning (FL) can protect the privacy of the vehicles in vehicle edge computing (VEC) to a certain extent through sharing the gradients of vehicles' local models instead of local data. The gradients of vehicles' local models are usually large for the vehicular artificial intelligence (AI) applications, thus transmitting such large gradients would cause large per-round latency. Gradient quantization has been proposed as one effective approach to reduce the per-round latency in FL enabled VEC through compressing gradients and reducing the number of bits, i.e., the quantization level, to transmit gradients. The selection of quantization level and thresholds determines the quantization error, which further affects the model accuracy and training time. To do so, the total training time and quantization error (QE) become two key metrics for the FL enabled VEC. It is critical to jointly optimize the total training time and QE for the FL enabled VEC. However, the time-varying channel condition causes more challenges to solve this problem. In this paper, we propose a distributed deep reinforcement learning (DRL)-based quantization level allocation scheme to optimize the long-term reward in terms of the total training time and QE. Extensive simulations identify the optimal weighted factors between the total training time and QE, and demonstrate the feasibility and effectiveness of the proposed scheme.


Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

Rhyner, Steve, Luo, Haocong, Gómez-Luna, Juan, Sadrosadati, Mohammad, Jiang, Jiawei, Olgun, Ataberk, Gupta, Harshita, Zhang, Ce, Mutlu, Onur

arXiv.org Artificial Intelligence

Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumption. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities and characteristics of popular distributed optimization algorithms on real-world PIM architectures to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized distributed optimization algorithms on UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and the need to shift to an algorithm-hardware codesign perspective to accommodate decentralized distributed optimization algorithms. Our results demonstrate three major findings: 1) Modern general-purpose PIM architectures can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, when operations and datatypes are natively supported by PIM hardware, 2) the importance of carefully choosing the optimization algorithm that best fit PIM, and 3) contrary to popular belief, contemporary PIM architectures do not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. To facilitate future research, we aim to open-source our complete codebase.


Multi-Agent Based Transfer Learning for Data-Driven Air Traffic Applications

Deng, Chuhao, Choi, Hong-Cheol, Park, Hyunsang, Hwang, Inseok

arXiv.org Artificial Intelligence

Research in developing data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers the multi-agent characteristic of the ATM system and learns air traffic controllers' decisions, and a pre-training and fine-tuning transfer learning framework. By pre-training the MA-BERT on a large dataset from a major airport and then fine-tuning it to other airports and specific air traffic applications, a large amount of the total training time can be saved. In addition, for newly adopted procedures and constructed airports where no historical data is available, this paper shows that the pre-trained MA-BERT can achieve high performance by updating regularly with little data. The proposed transfer learning framework and MA-BERT are tested with the automatic dependent surveillance-broadcast data recorded in 3 airports in South Korea in 2019.


Integrated Sensing, Computation, and Communication for UAV-assisted Federated Edge Learning

Tang, Yao, Zhu, Guangxu, Xu, Wei, Cheung, Man Hon, Lok, Tat-Ming, Cui, Shuguang

arXiv.org Artificial Intelligence

Federated edge learning (FEEL) enables privacy-preserving model training through periodic communication between edge devices and the server. Unmanned Aerial Vehicle (UAV)-mounted edge devices are particularly advantageous for FEEL due to their flexibility and mobility in efficient data collection. In UAV-assisted FEEL, sensing, computation, and communication are coupled and compete for limited onboard resources, and UAV deployment also affects sensing and communication performance. Therefore, the joint design of UAV deployment and resource allocation is crucial to achieving the optimal training performance. In this paper, we address the problem of joint UAV deployment design and resource allocation for FEEL via a concrete case study of human motion recognition based on wireless sensing. We first analyze the impact of UAV deployment on the sensing quality and identify a threshold value for the sensing elevation angle that guarantees a satisfactory quality of data samples. Due to the non-ideal sensing channels, we consider the probabilistic sensing model, where the successful sensing probability of each UAV is determined by its position. Then, we derive the upper bound of the FEEL training loss as a function of the sensing probability. Theoretical results suggest that the convergence rate can be improved if UAVs have a uniform successful sensing probability. Based on this analysis, we formulate a training time minimization problem by jointly optimizing UAV deployment, integrated sensing, computation, and communication (ISCC) resources under a desirable optimality gap constraint. To solve this challenging mixed-integer non-convex problem, we apply the alternating optimization technique, and propose the bandwidth, batch size, and position optimization (BBPO) scheme to optimize these three decision variables alternately.


Benchmarking GPU and TPU Performance with Graph Neural Networks

Ju, xiangyang, Wang, Yunsong, Murnane, Daniel, Choma, Nicholas, Farrell, Steven, Calafiura, Paolo

arXiv.org Artificial Intelligence

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly optimized for dense data representations. However, sparse representations such as graphs are prevalent in many domains, including science. It is therefore important to characterize the performance of available AI accelerators on sparse data. This work analyzes and compares the GPU and TPU performance training a Graph Neural Network (GNN) developed to solve a real-life pattern recognition problem. Characterizing the new class of models acting on sparse data may prove helpful in optimizing the design of deep learning libraries and future AI accelerators.